Title
This project aims to help investors select the right stock for investment using the principal component analysis technique. Principal component analysis (PCA) helps to uncover hidden patterns and associations in data. Its applications span finance, data science, data analysis, machine learning, and other fields. PCA transforms a dataset into a new set of independent variables (known as principle components) that are uncorrelated with one another.
The dataset originally contained several variables that could be correlated with one another. Principal component analysis reduces the size of the data and extracts useful information capable of representing the entire data set. In order to achieve these goals, it computes new variables called principal components that are generated as linear combinations of the original variables.
The first principal component is required to have the largest possible variance. The second component must be orthogonal to the first component in order to be computed. The other components are computed likewise. The values of these new variables for the observations are called factors scores, and they can be interpreted geometrically as the projections of the observations onto the principal components.
import pandas as pd # Used for data manipulation
import yfinance as yf # Used for downloading historical data
from sklearn.decomposition import PCA # Used for performing principal component analysis
import seaborn as sns # Used for data visualisation
import sweetviz as sv # Used for exploratory data analysis (EDA)
import numpy as np # Used to also performing computation
import matplotlib.pyplot as plt # Also used for data visualization
from sklearn import preprocessing # Used to scale the data before performing the principal component analysis
# List of stock tickers to download data
tickers = ['AAPL', 'GOOG', 'MSFT', 'AMZN', 'INTC', 'CSCO', 'CMCSA', 'PEP', 'GILD',
'NVDA', 'ORCL', 'T', 'VZ', 'PFE', 'HD', 'UNH', 'MO', 'WBA', 'BMY', 'MRK',
'NEE', 'DUK', 'EXC', 'SRE', 'SO', 'D', 'DOW', 'XOM', 'CVX', 'BAC', 'JPM',
'C', 'V', 'MA', 'WFC', 'USB', 'PNC', 'BK', 'MET', 'TRV', 'DIS', 'MCD', 'PYPL',
'COST', 'AMGN', 'UNP', 'HD', 'BA', 'KO', 'MCO', 'IBM', 'LMT', 'GS', 'AAL', 'AEP',
'AWK', 'DTE', 'ETN', 'EIX', 'NEE', 'NSC', 'PCG', 'PNW', 'PPL', 'PXD', 'VTR',
'WEC', 'XEL', 'XRX']
# Download data for all tickers
data = yf.download(tickers, start='2019-01-01', end='2021-12-31')
# Select the Adj Close column for all tickers
adj_close = data['Adj Close']
# Calculate the daily returns
returns = adj_close.pct_change()
# Print the first three rows of the data frame (returns)
returns.head(3)
returns.fillna(0, inplace= True )
# Check the shape of the data for verification
returns.shape # The data consists of 756 rows and 67 columns (stocks in this project).
returns.head(3) # Look at the first five rows and even more of the raws for verification.
analyze_report = sv.analyze(returns)
# This code prints out the reports of the EDA in the Jupyter notebook
analyze_report.show_notebook()
6. Transform the data to have a mean of zero (0) and a standard deviation of one (1).
Principal component anaysis is sensitive to the scale of the variables. The variables with bigger values tend to dominate the analysis and could mislead the analysis. To give equal weight to the variables, they should be scaled.
Scaling also helps find the underlying structure in the data. And as it can be seen in the above exploratory data analysis (EDA), the values in the variable are varying significantly from each other and should be scaled.
In this project, the data (returns) is scaled and stored in a new dataframe called s_returns.
s_returns = preprocessing.scale(returns)
pca = PCA( n_components = 4)
pca.fit(s_returns)
returns_transformed = pca.transform(s_returns)
# Store the principal components into a dataframe (finpca) and label the columns accordingly
finpca = pd.DataFrame( returns_transformed, columns = ['pc1', 'pc2','pc3','pc4'])
# visualize the data
finpca.head(4)
10. Check the level of variation in the data the four principal components have been able to capture.
In this instance, it has been able to capture almost 68% of the variation (information) in the data. This is a good sign, so the four components will be maintained for the rest of the analysis.
The information captured in our data would also be plotted in a scree plot.
print( 'This ratio explained by the each four(4) principal components are :', '\t', pca.explained_variance_ratio_,
'\n', 'The ratio explained by the total four(4) principal components are :', '\t', pca.explained_variance_ratio_.sum() , 'or' , (pca.explained_variance_ratio_*100).sum(), '%' )
# Visualize the amount of variance captured in the data by the four principal components on a scree plot
# Get the percentage of variance explained by each component
var_explained = pca.explained_variance_ratio_ *100
# Get the number of components
n_components = len(var_explained) +1
# Create a figure and an Axes object
fig, ax = plt.subplots()
# Plot the explained variance as a bar plot
ax.bar(range(1, n_components), var_explained)
# Set the xlabel to "Number of components"
ax.set_xlabel("Number of Components")
# Set the ylabel to "Explained variance"
ax.set_ylabel("Explained Variance %")
# Add a title to the plot
plt.title("Scree Plot")
# Set the theme of the plot to "darkgrid" and the color palette to "colorblind"
sns.set_style("darkgrid")
# Show the plot
plt.show()
loadings = pca.components_
variable_names = list(returns.columns)
# Index name used in the loading dataframe(loading_df)
indexname = ['first_pc_loadings', 'second_pc_loadings', 'third_pc_loadings', 'forth_pc_loadings']
# Store the dataframe containing the loadings values
loadings_df = pd.DataFrame(loadings, columns= variable_names, index = indexname)
loadings_df.head(5)
13. Select the maximum absolute value for each of the four principal component loadings.
The maximum absolute value represents the stock, which is selected to represent that principal component loading.
MET: MetLife, Inc., WEC: Wisconsin Energy Corporation, AMZN: Amazon.com, Inc., and GILD: Gilead Sciences, Inc. are the stocks that represented the four principal components.
These selected stocks represent all 67 stocks. As an investor, you can buy these four stocks rather than a large number of small stocks. This is because they capture most of the variations in the other unselected 63 stocks.
loadings_df.abs().idxmax(axis="columns")
NOTE:
Here are the names of the sixty-seven (67) companies whose stocks were used in this project.
AAL: American Airlines Group Inc
AAPL: Apple Inc
AEP: American Electric Power Company Inc
AMGN: Amgen Inc
AMZN: Amazon.com, Inc
AWK: American Water Works Company Inc
BA: The Boeing Company
BAC: Bank of America Corporation
BK: The Bank of New York Mellon Corporation
BMY: Bristol-Myers Squibb Company
C: Citigroup Inc
CMCSA: Comcast Corporation
COST: Costco Wholesale Corporation
CSCO: Cisco Systems, Inc
CVX: Chevron
D: Dominion Energy Inc
DIS: The Walt Disney Company
DOW: Dow Inc
DTE: DTE Energy Company
DUK: Duke Energy Corporation
EIX: Edison International
ETN: Eaton Corporation plc
EXC: Exelon Corporation
GILD: Gilead Sciences, Inc
GOOG: Alphabet Inc (Google)
GS: The Goldman Sachs Group, Inc
HD: The Home Depot, Inc
IBM: International Business Machines Corporation
INTC: Intel Corporation
JPM: JPMorgan Chase & Co
KO: The Coca-Cola Company
LMT: Lockheed Martin Corporation
MA: Mastercard Inc
MCD: McDonald's Corporation
MCO: Moody's Corporation
MET: MetLife, Inc
MO: Altria Group Inc
MRK: Merck & Co., Inc
MSFT: Microsoft Corporation
NEE: NextEra Energy Inc
NSC: Norfolk Southern Corporation
NVDA: NVIDIA Corporation
ORCL: Oracle Corporation
PCG: PG&E Corporation
PEP: PepsiCo, Inc
PFE: Pfizer Inc
PNC: The PNC Financial Services Group, Inc
PNW: Pinnacle West Capital Corporation
PPL: PPL Corporation
PXD: Pioneer Natural Resources Company
PYPL: PayPal Holdings, Inc
SO: Southern Company (The)
SRE: Sempra Energy
T: AT&T Inc
TRV: The Travelers Companies, Inc
UNH: UnitedHealth Group Inc
UNP: Union Pacific Corporation
USB: U.S. Bancorp
V: Visa Inc
VTR: Ventas, Inc
VZ: Verizon Communications Inc
WBA: Walgreens Boots Alliance, Inc
WEC: Wisconsin Energy Corporation
WFC: Wells Fargo & Company
XEL: Xcel Energy Inc
XOM: Exxon Mobil Corporation
XRX: Xerox Corporation
===========================================Thank You=====================================================
Name : Faakye Konadu Samuel
Position: Data Analyst
Email: konadufaakye@gmail.com
Tel: +233245938995
loadings_df.abs().max()